Search CORE

160 research outputs found

Block CUR: Decomposing Matrices using Groups of Columns

Author: C Boutsidis
CW Yip
J Yang
M Rudelson
MW Mahoney
P Drineas
P Drineas
P Paschou
S Jain
Publication venue
Publication date: 09/07/2018
Field of study

A common problem in large-scale data analysis is to approximate a matrix using a combination of specifically sampled rows and columns, known as CUR decomposition. Unfortunately, in many real-world environments, the ability to sample specific individual rows or columns of the matrix is limited by either system constraints or cost. In this paper, we consider matrix approximation by sampling predefined \emph{blocks} of columns (or rows) from the matrix. We present an algorithm for sampling useful column blocks and provide novel guarantees for the quality of the approximation. This algorithm has application in problems as diverse as biometric data analysis to distributed computing. We demonstrate the effectiveness of the proposed algorithms for computing the Block CUR decomposition of large matrices in a distributed setting with multiple nodes in a compute cluster, where such blocks correspond to columns (or rows) of the matrix stored on the same node, which can be retrieved with much less overhead than retrieving individual columns stored across different nodes. In the biometric setting, the rows correspond to different users and columns correspond to users' biometric reaction to external stimuli, {\em e.g.,}~watching video content, at a particular time instant. There is significant cost in acquiring each user's reaction to lengthy content so we sample a few important scenes to approximate the biometric response. An individual time sample in this use case cannot be queried in isolation due to the lack of context that caused that biometric reaction. Instead, collections of time segments ({\em i.e.,} blocks) must be presented to the user. The practical application of these algorithms is shown via experimental results using real-world user biometric data from a content testing environment.Comment: shorter version to appear in ECML-PKDD 201

arXiv.org e-Print Archive

Crossref

Optimal CUR Matrix Decompositions

Author: Boutsidis C.
Drineas P.
Gu M.
Guruswami V.
Wang S.
Publication venue
Publication date: 16/07/2014
Field of study

The CUR decomposition of an

m \times n

matrix

A

finds an

m \times c

matrix

C

with a subset of

c < n

columns of

A,

together with an

r \times n

matrix

R

with a subset of

r < m

rows of

A,

as well as a

c \times r

low-rank matrix

U

such that the matrix

C U R

approximates the matrix

A,

that is,

|| A - CUR ||_F^2 \le (1+\epsilon) || A - A_k||_F^2

, where

||.||_F

denotes the Frobenius norm and

A_k

is the best

m \times n

matrix of rank

k

constructed via the SVD. We present input-sparsity-time and deterministic algorithms for constructing such a CUR decomposition where

c=O(k/\epsilon)

and

r=O(k/\epsilon)

and rank

(U) = k

. Up to constant factors, our algorithms are simultaneously optimal in

c, r,

and rank

(U)

.Comment: small revision in lemma 4.

arXiv.org e-Print Archive

CiteSeerX

Crossref

Variant Ranker: a web-tool to rank genomic data according to functional significance

Author: Alexander J.
Drineas P.
Georgitsi M.
Mantzaris D.
Paschou P.
Publication venue: BioMed Central
Publication date: 01/07/2017
Field of study

BACKGROUND: The increasing volume and complexity of high-throughput genomic data make analysis and prioritization of variants difficult for researchers with limited bioinformatics skills. Variant Ranker allows researchers to rank identified variants and determine the most confident variants for experimental validation. RESULTS: We describe Variant Ranker, a user-friendly simple web-based tool for ranking, filtering and annotation of coding and non-coding variants. Variant Ranker facilitates the identification of causal variants based on novelty, effect and annotation information. The algorithm implements and aggregates multiple prediction algorithm scores, conservation scores, allelic frequencies, clinical information and additional open-source annotations using accessible databases via ANNOVAR. The available information for a variant is transformed into user-specified weights, which are in turn encoded into the ranking algorithm. Through its different modules, users can (i) rank a list of variants (ii) perform genotype filtering for case-control samples (iii) filter large amounts of high-throughput data based on user custom filter requirements and apply different models of inheritance (iv) perform downstream functional enrichment analysis through network visualization. Using networks, users can identify clusters of genes that belong to multiple ontology categories (like pathways, gene ontology, disease categories) and therefore expedite scientific discoveries. We demonstrate the utility of Variant Ranker to identify causal genes using real and synthetic datasets. Our results indicate that Variant Ranker exhibits excellent performance by correctly identifying and ranking the candidate genes CONCLUSIONS: Variant Ranker is a freely available web server on http://paschou-lab.mbg.duth.gr/Software.html . This tool will enable users to prioritise potentially causal variants and is applicable to a wide range of sequencing data

ZENODO

Directory of Open Access Journals

White Rose Research Online

Sketching Algorithms for Sparse Dictionary Learning: PTAS and Turnstile Streaming

Author: Dexter Gregory
Drineas Petros
Woodruff David P.
Yasuda Taisuke
Publication venue
Publication date: 29/10/2023
Field of study

Sketching algorithms have recently proven to be a powerful approach both for designing low-space streaming algorithms as well as fast polynomial time approximation schemes (PTAS). In this work, we develop new techniques to extend the applicability of sketching-based approaches to the sparse dictionary learning and the Euclidean

k

-means clustering problems. In particular, we initiate the study of the challenging setting where the dictionary/clustering assignment for each of the

n

input points must be output, which has surprisingly received little attention in prior work. On the fast algorithms front, we obtain a new approach for designing PTAS's for the

k

-means clustering problem, which generalizes to the first PTAS for the sparse dictionary learning problem. On the streaming algorithms front, we obtain new upper bounds and lower bounds for dictionary learning and

k

-means clustering. In particular, given a design matrix

\mathbf A\in\mathbb R^{n\times d}

in a turnstile stream, we show an

\tilde O(nr/\epsilon^2 + dk/\epsilon)

space upper bound for

r

-sparse dictionary learning of size

k

, an

\tilde O(n/\epsilon^2 + dk/\epsilon)

space upper bound for

k

-means clustering, as well as an

\tilde O(n)

space upper bound for

k

-means clustering on random order row insertion streams with a natural "bounded sensitivity" assumption. On the lower bounds side, we obtain a general

\tilde\Omega(n/\epsilon + dk/\epsilon)

lower bound for

k

-means clustering, as well as an

\tilde\Omega(n/\epsilon^2)

lower bound for algorithms which can estimate the cost of a single fixed set of candidate centers.Comment: To appear in NeurIPS 202

arXiv.org e-Print Archive

Randomized Extended Kaczmarz for Solving Least-Squares

Author: Anastasios Zouzias
Clarkson K. L.
Clarkson K. L.
Davis T. A.
Drineas P.
Kaczmarz S.
Magen A.
Nguyen N. H.
Nikolaos M. Freris
Tompkins C.
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 05/01/2013
Field of study

We present a randomized iterative algorithm that exponentially converges in expectation to the minimum Euclidean norm least squares solution of a given linear system of equations. The expected number of arithmetic operations required to obtain an estimate of given accuracy is proportional to the square condition number of the system multiplied by the number of non-zeros entries of the input matrix. The proposed algorithm is an extension of the randomized Kaczmarz method that was analyzed by Strohmer and Vershynin.Comment: 19 Pages, 5 figures; code is available at https://github.com/zouzias/RE

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Fast approximation of matrix coherence and statistical leverage

Author: David P. Woodruff
Malik Magdon-ismail
Mehryar Mohri
Michael W. Mahoney
Petros Drineas
Publication venue
Publication date: 01/01/2011
Field of study

The statistical leverage scores of a matrix

A

are the squared row-norms of the matrix containing its (top) left singular vectors and the coherence is the largest leverage score. These quantities are of interest in recently-popular problems such as matrix completion and Nystr\"{o}m-based low-rank matrix approximation as well as in large-scale statistical data analysis applications more generally; moreover, they are of interest since they define the key structural nonuniformity that must be dealt with in developing fast randomized matrix algorithms. Our main result is a randomized algorithm that takes as input an arbitrary

n \times d

matrix

A

, with

n \gg d

, and that returns as output relative-error approximations to all

n

of the statistical leverage scores. The proposed algorithm runs (under assumptions on the precise values of

n

and

d

) in

O(n d \log n)

time, as opposed to the

O(nd^2)

time required by the na\"{i}ve algorithm that involves computing an orthogonal basis for the range of

A

. Our analysis may be viewed in terms of computing a relative-error approximation to an underconstrained least-squares approximation problem, or, relatedly, it may be viewed as an application of Johnson-Lindenstrauss type ideas. Several practically-important extensions of our basic result are also described, including the approximation of so-called cross-leverage scores, the extension of these ideas to matrices with

n \approx d

, and the extension to streaming environments.Comment: 29 pages; conference version is in ICML; journal version is in JML

arXiv.org e-Print Archive

CiteSeerX

Solving $k$ -means on High-dimensional Big Data

Author: AK Jain
H Steinhaus
J Stallmann
JL Bentley
K Jain
MR Ackermann
MW Mahoney
N Halko
P Drineas
PK Agarwal
T Kanungo
T Zhang
X Wu
Publication venue
Publication date: 01/01/2015
Field of study

In recent years, there have been major efforts to develop data stream algorithms that process inputs in one pass over the data with little memory requirement. For the

k

-means problem, this has led to the development of several

(1+\varepsilon)

-approximations (under the assumption that

k

is a constant), but also to the design of algorithms that are extremely fast in practice and compute solutions of high accuracy. However, when not only the length of the stream is high but also the dimensionality of the input points, then current methods reach their limits. We propose two algorithms, piecy and piecy-mr that are based on the recently developed data stream algorithm BICO that can process high dimensional data in one pass and output a solution of high quality. While piecy is suited for high dimensional data with a medium number of points, piecy-mr is meant for high dimensional data that comes in a very long stream. We provide an extensive experimental study to evaluate piecy and piecy-mr that shows the strength of the new algorithms.Comment: 23 pages, 9 figures, published at the 14th International Symposium on Experimental Algorithms - SEA 201

arXiv.org e-Print Archive

computer science publication server

Crossref

Kölner UniversitätsPublikationsServer

Near Optimal Linear Algebra in the Online and Sliding Window Models

Author: Braverman Vladimir
Drineas Petros
Musco Cameron
Musco Christopher
Upadhyay Jalaj
Woodruff David P.
Zhou Samson
Publication venue
Publication date: 19/04/2020
Field of study

We initiate the study of numerical linear algebra in the sliding window model, where only the most recent

W

updates in a stream form the underlying data set. We first introduce a unified row-sampling based framework that gives randomized algorithms for spectral approximation, low-rank approximation/projection-cost preservation, and

\ell_1

-subspace embeddings in the sliding window model, which often use nearly optimal space and achieve nearly input sparsity runtime. Our algorithms are based on "reverse online" versions of offline sampling distributions such as (ridge) leverage scores,

\ell_1

sensitivities, and Lewis weights to quantify both the importance and the recency of a row. Our row-sampling framework rather surprisingly implies connections to the well-studied online model; our structural results also give the first sample optimal (up to lower order terms) online algorithm for low-rank approximation/projection-cost preservation. Using this powerful primitive, we give online algorithms for column/row subset selection and principal component analysis that resolves the main open question of Bhaskara et. al.,(FOCS 2019). We also give the first online algorithm for

\ell_1

-subspace embeddings. We further formalize the connection between the online model and the sliding window model by introducing an additional unified framework for deterministic algorithms using a merge and reduce paradigm and the concept of online coresets. Our sampling based algorithms in the row-arrival online model yield online coresets, giving deterministic algorithms for spectral approximation, low-rank approximation/projection-cost preservation, and

\ell_1

-subspace embeddings in the sliding window model that use nearly optimal space

arXiv.org e-Print Archive

On landmark selection and sampling in high-dimensional data analysis

Author: Blackburn J.
Deshpande A.
Drineas P.
Elgammal A.
Fowlkes C.
Lee K.-C.
Lee K.-C.
Liu R.
Ouimet M.
Platt J. C.
Smola A. J.
Talwalkar A.
Williams C. K. I.
Publication venue: 'The Royal Society'
Publication date: 24/06/2009
Field of study

In recent years, the spectral analysis of appropriately defined kernel matrices has emerged as a principled way to extract the low-dimensional structure often prevalent in high-dimensional data. Here we provide an introduction to spectral methods for linear and nonlinear dimension reduction, emphasizing ways to overcome the computational limitations currently faced by practitioners with massive datasets. In particular, a data subsampling or landmark selection process is often employed to construct a kernel based on partial information, followed by an approximate spectral analysis termed the Nystrom extension. We provide a quantitative framework to analyse this procedure, and use it to demonstrate algorithmic performance bounds on a range of practical approaches designed to optimize the landmark selection process. We compare the practical implications of these bounds by way of real-world examples drawn from the field of computer vision, whereby low-dimensional manifold structure is shown to emerge from high-dimensional video data streams.Comment: 18 pages, 6 figures, submitted for publicatio

arXiv.org e-Print Archive

Crossref

PubMed Central

UCL Discovery